NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accurate short-read alignment through r-index-based pangenome indexing

https://doi.org/10.1101/gr.279858.124

Varki, Rahul; Rossi, Massimiliano; Ferro, Eddie; Oliva, Marco; Garrison, Erik; Langmead, Ben; Boucher, Christina (June 2025, Genome Research)

Aligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the r-index, a variation of the classical FM-index that can index collections of genomes in O(r)-space, whereris the number of runs in the Burrows–Wheeler transform. Moni-align uses a seed-and-extend strategy for aligning reads, utilizing maximal exact matches as seeds, which can be efficiently obtained with ther-index. Using both simulated and real short-read data sets, we demonstrate that Moni-align achieves alignment accuracy comparable to vg map and vg giraffe, the leading pangenome aligners. Although currently best suited for aligning to localized pangenomes owing to computational constraints, Moni-align offers a robust foundation for future optimizations that could further broaden its applicability.
more » « less
Full Text Available
Pfp-fm: an accelerated FM-index

https://doi.org/10.1186/s13015-024-00260-8

Hong, Aaron; Oliva, Marco; Köppl, Dominik; Bannai, Hideo; Boucher, Christina; Gagie, Travis (December 2024, Algorithms for Molecular Biology)

Abstract FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38, and is consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it seems our method accelerates the performance of count over all state-of-the-art methods with a moderate increase in the memory. The source code for$$\texttt {PFP-FM}$$ $PFP - FM$ is available athttps://github.com/AaronHong1024/afm.
more » « less
Full Text Available
Building a pangenome alignment index via recursive prefix-free parsing

https://doi.org/10.1016/j.isci.2024.110933

Ferro, Eddie; Oliva, Marco; Gagie, Travis; Boucher, Christina (October 2024, iScience)

Full Text Available
ONeSAMP 3.0: estimation of effective population size via single nucleotide polymorphism data from one population

https://doi.org/10.1093/g3journal/jkae153

Hong, Aaron; Cheek, Rebecca G; De_Silva, Suhashi Nihara; Mukherjee, Kingshuk; Yooseph, Isha; Oliva, Marco; Heim, Mark; W_Funk, Chris; Tallmon, David; Boucher, Christina (July 2024, G3: Genes, Genomes, Genetics)
Myers, C (Ed.)
Abstract The genetic effective size (Ne) is arguably one of the most important characteristics of a population as it impacts the rate of loss of genetic diversity. Methods that estimate Ne are important in population and conservation genetic studies as they quantify the risk of a population being inbred or lacking genetic diversity. Yet there are very few methods that can estimate the Ne from data from a single population and without extensive information about the genetics of the population, such as a linkage map, or a reference genome of the species of interest. We present ONeSAMP 3.0, an algorithm for estimating Ne from single nucleotide polymorphism data collected from a single population sample using approximate Bayesian computation and local linear regression. We demonstrate the utility of this approach using simulated Wright–Fisher populations, and empirical data from five endangered Channel Island fox (Urocyon littoralis) populations to evaluate the performance of ONeSAMP 3.0 compared to a commonly used Ne estimator. Our results show that ONeSAMP 3.0 is broadly applicable to natural populations and is flexible enough that future versions could easily include summary statistics appropriate for a suite of biological and sampling conditions. ONeSAMP 3.0 is publicly available under the GNU General Public License at https://github.com/AaronHong1024/ONeSAMP_3.
more » « less
Full Text Available
Recursive Prefix-Free Parsing for Building Big BWTs

https://doi.org/10.1109/DCC55655.2023.00014

Oliva, Marco; Gagie, Travis; Boucher, Christina (March 2023, IEEE Data Compression Conference)

Full Text Available
Target-enriched long-read sequencing (TELSeq) contextualizes antimicrobial resistance genes in metagenomes

https://doi.org/10.1186/s40168-022-01368-y

Slizovskiy, Ilya B.; Oliva, Marco; Settle, Jonathen K.; Zyskina, Lidiya V.; Prosperi, Mattia; Boucher, Christina; Noyes, Noelle R. (December 2022, Microbiome)

Abstract Background Metagenomic data can be used to profile high-importance genes within microbiomes. However, current metagenomic workflows produce data that suffer from low sensitivity and an inability to accurately reconstruct partial or full genomes, particularly those in low abundance. These limitations preclude colocalization analysis, i.e., characterizing the genomic context of genes and functions within a metagenomic sample. Genomic context is especially crucial for functions associated with horizontal gene transfer (HGT) via mobile genetic elements (MGEs), for example antimicrobial resistance (AMR). To overcome this current limitation of metagenomics, we present a method for comprehensive and accurate reconstruction of antimicrobial resistance genes (ARGs) and MGEs from metagenomic DNA, termed t arget- e nriched l ong-read seq uencing (TELSeq). Results Using technical replicates of diverse sample types, we compared TELSeq performance to that of non-enriched PacBio and short-read Illumina sequencing. TELSeq achieved much higher ARG recovery (>1,000-fold) and sensitivity than the other methods across diverse metagenomes, revealing an extensive resistome profile comprising many low-abundance ARGs, including some with public health importance. Using the long reads generated by TELSeq, we identified numerous MGEs and cargo genes flanking the low-abundance ARGs, indicating that these ARGs could be transferred across bacterial taxa via HGT. Conclusions TELSeq can provide a nuanced view of the genomic context of microbial resistomes and thus has wide-ranging applications in public, animal, and human health, as well as environmental surveillance and monitoring of AMR. Thus, this technique represents a fundamental advancement for microbiome research and application.
more » « less
Full Text Available
MONI: A Pangenomic Index for Finding Maximal Exact Matches

https://doi.org/10.1089/cmb.2021.0290

Rossi, Massimiliano; Oliva, Marco; Langmead, Ben; Gagie, Travis; Boucher, Christina (February 2022, Journal of Computational Biology)

Full Text Available
CSTs for Terabyte-Sized Data

https://doi.org/10.1109/DCC52660.2022.00017

Oliva, Marco; Cenzato, Davide; Rossi, Massimiliano; Liptak, Zsuzsanna; Gagie, Travis; Boucher, Christina (March 2022, Data Compression Conference (DCC))

Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enor- mously beneficial to the community. In this paper, we present a method, which we refer to as RePFP-CST, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.
more » « less
Full Text Available
Finding Maximal Exact Matches Using the r-Index

https://doi.org/10.1089/cmb.2021.0445

Rossi, Massimiliano; Oliva, Marco; Bonizzoni, Paola; Langmead, Ben; Gagie, Travis; Boucher, Christina (February 2022, Journal of Computational Biology)

Full Text Available
AMR-meta: a k -mer and metafeature approach to classify antimicrobial resistance from high-throughput short-read metagenomics data

https://doi.org/10.1093/gigascience/giac029

Marini, Simone; Oliva, Marco; Slizovskiy, Ilya B; Das, Rishabh A; Noyes, Noelle Robertson; Kahveci, Tamer; Boucher, Christina; Prosperi, Mattia (January 2022, GigaScience)

Abstract Background Antimicrobial resistance (AMR) is a global health concern. High-throughput metagenomic sequencing of microbial samples enables profiling of AMR genes through comparison with curated AMR databases. However, the performance of current methods is often hampered by database incompleteness and the presence of homology/homoplasy with other non-AMR genes in sequenced samples. Results We present AMR-meta, a database-free and alignment-free approach, based on k-mers, which combines algebraic matrix factorization into metafeatures with regularized regression. Metafeatures capture multi-level gene diversity across the main antibiotic classes. AMR-meta takes in reads from metagenomic shotgun sequencing and outputs predictions about whether those reads contribute to resistance against specific classes of antibiotics. In addition, AMR-meta uses an augmented training strategy that joins an AMR gene database with non-AMR genes (used as negative examples). We compare AMR-meta with AMRPlusPlus, DeepARG, and Meta-MARC, further testing their ensemble via a voting system. In cross-validation, AMR-meta has a median f-score of 0.7 (interquartile range, 0.2–0.9). On semi-synthetic metagenomic data—external test—on average AMR-meta yields a 1.3-fold hit rate increase over existing methods. In terms of run-time, AMR-meta is 3 times faster than DeepARG, 30 times faster than Meta-MARC, and as fast as AMRPlusPlus. Finally, we note that differences in AMR ontologies and observed variance of all tools in classification outputs call for further development on standardization of benchmarking data and protocols. Conclusions AMR-meta is a fast, accurate classifier that exploits non-AMR negative sets to improve sensitivity and specificity. The differences in AMR ontologies and the high variance of all tools in classification outputs call for the deployment of standard benchmarking data and protocols, to fairly compare AMR prediction tools.
more » « less
Full Text Available

« Prev Next »

Search for: All records